Francis Galton

mentions 1 type Person feed RSS

// recent coverage 1 mentions

04:24

2026-06-21

thewatershed.markpesce.com

ai-safety

Why Evals are Hard

AI evaluations are failing as models approach general intelligence, with benchmarks saturating through contamination and Goodhart effects while the scope of evaluation expands from minutes to months. …

// co-occurs with top 4 entities

Mark Pesce 1 University of Sydney 1 Fable 5 1 MMLU 1